Background

CarPrices is a data set that contains 80 samples of Cadillac cars. We will take a look specifically at Price as a factor of Mileage for the Deville model, and take a look at how the kind of Trim style the Deville model has greatly influences the price. We will be using linear regression to model the data, first without trim as a factor, and then with trim included.

Price vs Mileage

Let’s take a look at price vs mileage of the Cadillac Deville model in the dataset, with a single linear regression line added to the chart. Notice how the data points are colored according to trim type.

# Assuming 'Deville' is your dataframe containing the data

deville_lm <- lm(Price ~ Mileage, data = Deville)

b <- coef(deville_lm)

p <- plot_ly(data = Deville, x = ~Mileage, y = ~Price, type = "scatter", mode = "markers", 
             color = ~Trim,
             text = ~paste("Trim: ", Trim)) %>%
  layout(title = "Price vs Mileage",
         xaxis = list(title = "Mileage"),
         yaxis = list(title = "Price"))

# Add regression line to the plot
p <- add_trace(p, x = Deville$Mileage, y = b[1] + b[2]*Deville$Mileage, 
               type = "scatter", mode = "lines", line = list(color = "green"))

# Print the plot
p

Hypothesis and Model Comparison.

We can see that visually, according to trim color, the data is almost split into two lines of data. We could anticipate that using two lines might serve as a better model for the data.

Below is the model for simple linear regression:

\[ Y_i = \beta_0 + \beta_1 X_{1i} + \epsilon_i \text{ ,where } ϵi∼N(0,σ2) \]

Below is the equation for a two lines reduced linear regression model:

\[ \underbrace{Y_i}_\text{Price} = \underbrace{\beta_0 + \beta_1X_{1i} + \beta_2X_{2i}}_{\text{E}\{Y_i\}} + \epsilon_i \]

\[ X_{2i} = \begin{cases} 1 & \text{if Trim = DHS Sedan 4D or DTS Sedan 4D} \\ 0 & \text{if Trim = Sedan 4D} \end{cases} \]

Note that \(\text{E}\{Y_i\}\) is our expected value, while the error term models the points being distributed normally around the expected value.

Our two cases, represented in the dataset as ‘Trim_Case’ splits the data according to whether the trim is DHS Sedan 4D or DTS Sedan 4D, or if it is Sedan 4D (sans DHS or DTS).

We use the reduced version of the two lines model because the 3rd term, which would be represented by \(β3X_{1i}X_{2i}\) would represent a change in slope based on Trim case, but the effect of the kind of trim on Mileage was shown to be 0.2175, meaning it was not significant in the model.

Our null hypothesis is that the two lines reduced model will not have a significant increase in adjusted \(R^2\), or the p values of either the intercept or slope of the model. The alternative hypothesis is that there will be a significant increase in at least one of these factors.

Two Lines Model

Let’s take a look visually at the two lines model:

Deville <- Deville %>%
  mutate(
    Trim_Case = case_when(
      Trim %in% c("DHS Sedan 4D", "DTS Sedan 4D") ~ 1,
      Trim == "Sedan 4D" ~ 0
    )
  )


# Fit linear models
lm_trim <- lm(Price ~ Mileage + Trim_Case, data = Deville)


# Obtain fitted values
# fitted_values1 <- predict(lm_dts_dth)

# Get coefficients
bd <- coef(lm_trim) 

# Create scatter plot
p <- plot_ly(data = Deville, x = ~Mileage, y = ~Price, type = "scatter", mode = "markers", 
             color = ~Trim,
             text = ~paste("Trim: ", Trim)) %>%
  layout(title = "Price vs Mileage",
         xaxis = list(title = "Mileage"),
         yaxis = list(title = "Price"))

# Add regression lines
p <- add_trace(p, x = Deville$Mileage, y = bd[1] + bd[2]*Deville$Mileage, 
               type = "scatter", mode = "lines", line = list(color = "lightblue"))

p <- add_trace(p, x = Deville$Mileage, y = (bd[1] + bd[3]) + bd[2]*Deville$Mileage, 
               type = "scatter", mode = "lines", line = list(color = "pink"))

p  # Print the plot

Visually, we can see that these two lines seem to be a better fit for the data. Let’s see how the models compare based on charts below. The first set of charts are for the single line model, and the second set is for the two lines model.

# Set up a 1x2 grid for plots
par(mfrow=c(1,3))
plot(deville_lm, which=1)
qqPlot(deville_lm$residuals, id=FALSE)
plot(deville_lm$residuals)

par(mfrow=c(1,3))
plot(lm_trim, which=1)
qqPlot(lm_trim, id=FALSE)
plot(lm_trim$residuals)

The residuals vs fitted and qqPlot charts do not visually appear all too different. However, the index chart for the two lines model seems to be much less patterned and more evenly distributed. These charts do not seem to provide much insight into if the model is better or not- but it is noteworthy that it would seem like they would show that the one line model is viable for linear regression, even a one line model proves not to be ideal.

Statistical Accuracy comparison

Let’s compare the models based on the statistical significance of their terms, and the r-squared values. Below is a table comparing those items:

lm_standard <- lm(Price ~ Mileage + Trim_Case + Mileage:as.factor(Trim_Case), data=Deville)

# Function to extract p-values and handle missing coefficients
extract_p_values <- function(model_summary, num_coefficients) {
  p_values <- rep(NA, num_coefficients)
  summary_p_values <- summary(model_summary)$coefficients[, "Pr(>|t|)"]
  p_values[1:length(summary_p_values)] <- summary_p_values
  return(p_values)
}

# Extract p-values and R-squared for each model
p_values_deville <- extract_p_values(deville_lm, 3)  # Assuming 3 coefficients in deville_lm
r_squared_deville <- summary(deville_lm)$r.squared

p_values_trim <- extract_p_values(lm_trim, 4)        # Assuming 4 coefficients in lm_trim
r_squared_trim <- summary(lm_trim)$r.squared

p_values_standard <- extract_p_values(lm_standard, 5)  # Assuming 5 coefficients in lm_standard
r_squared_standard <- summary(lm_standard)$r.squared

# Create a data frame to store the p-values and R-squared values
p_values_df <- data.frame(
  Model = c("Standard Linear Model", "Two Lines Reduced Model", "Two Lines Standard Model"),
  Intercept = c(p_values_deville[1], p_values_trim[1], p_values_standard[1]),
  Mileage = c(p_values_deville[2], p_values_trim[2], p_values_standard[2]),
  Trim_Case = c(p_values_deville[3], p_values_trim[3], p_values_standard[3]),
  Interaction = c(NA, p_values_trim[4], p_values_standard[4]),  # Set NA for the models without Interaction term
  R_squared = c(r_squared_deville, r_squared_trim, r_squared_standard)
)

# Display the p-values and R-squared table
pander(p_values_df, caption = "P-values and R-squared for Coefficients")
P-values and R-squared for Coefficients (continued below)
Model Intercept Mileage Trim_Case Interaction
Standard Linear Model 1.706e-23 0.0003624 NA NA
Two Lines Reduced Model 3.658e-36 2.994e-16 1.471e-16 NA
Two Lines Standard Model 8.932e-29 6.191e-08 2.688e-08 0.2175
R_squared
0.37
0.9515
0.9543

He can see the important parts of each model, and their statistical significance. Note how even though the two lines standard model has a higher R squared model, the 3rd term is not statistically significant, and both the X1 and X2 terms are less statistically significant than compared respectively to the two lines reduced model. The two lines reduced model has a significantly higher R squared value than the standard linear model, and each factor is statistically signifant.

These findings support our alternative hypothesis; at least one factor is statistically more significant (all factors), so we can reject the null hypothesis and conclude that the two lines reduced model is a better fit for the data set.

Conclusion

We saw that the two lines model was a better fit for the data, and that the slope and intercept factors were more statistically significant in the two lines than the one line. Because the interaction term in the Two lines standard model was statistically insignificant, we used the two lines reduced model. Due to its more statistically signifcant factors, we were able to reject the null hypothesis in favor of the alternative hypothesis.